feat(sparserl-sync): integrate SparseRL-Sync sparse diff tracking into optimizer#5
Open
tonic-scitix wants to merge 1 commit into
Open
feat(sparserl-sync): integrate SparseRL-Sync sparse diff tracking into optimizer#5tonic-scitix wants to merge 1 commit into
tonic-scitix wants to merge 1 commit into
Conversation
…o optimizer Wrap the in-place param.copy_() calls in DistributedOptimizer and HybridDeviceOptimizer with sparse_diff_context() so the attached SparseManager can snapshot pre/post state and build per-param sparse-update indices for the Trainer→Rollout weight broadcast. Also pass the communication group to P2POp constructors in p2p_communication.py (group was omitted; required for non-default PG usage). Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(sparserl-sync): integrate SparseRL-Sync sparse diff tracking into optimizer
What does this PR do?
Instrument the Megatron optimizer's in-place weight-copy sites so the
SparseRL-Sync package can compute per-param sparse-update indices for each
rollout. When
sparse_updateis not installed the imports fall back tonullcontextno-ops and the upstream behaviour is fully preserved.This PR is one of three coordinated changes:
Modifications
megatron/core/optimizer/distrib_optimizer.pyfrom sparse_update import init_sparse_manager, sparse_diff_context; falls back tonullcontext/ no-op when not installed._build_model_and_main_param_groups: callsinit_sparse_manager(model_param, shard_model_weight, shard_main_weight, param_range)for both fp16/bf16 and fp32 param groups so theSparseManagerknows which DP-local buffer offset maps to which model param._copy_main_params_to_model_params: wrapsshard_model_param.data.copy_(shard_main_param)withsparse_diff_context(shard_model_param, shard_main_param)so the manager snapshots pre/post state and emits per-param sparse indices.megatron/core/optimizer/cpu_offloading/hybrid_optimizer.pyfrom sparse_update import sparse_diff_context.param_copy_back_gpu_hookH2D copy andfp32_param_copy_back_gpu_hookfp32→fp16 copy) withsparse_diff_context(target, source).megatron/core/pipeline_parallel/p2p_communication.pygrouppositional argument to all fourtorch.distributed.P2POp(...)constructors. It was previously omitted, which causes deadlocks when SparseRL-Sync's NCCL group overlaps with the world group during sparse weight-sync.Pre-checks
Notes for reviewers
importblocks aretry/except ImportError; removing thesparse_updatepackage from the environment restores byte-identicalupstream behaviour with zero code changes.
init_sparse_manageris a one-time bind per shard slice; it does notallocate any GPU memory itself.
sparse_diff_contextis a lightweight CUDA-stream-safe context manager; itdoes not add synchronisation points beyond the existing stream ordering.
p2p_communication.pyfix is independent of SparseRL-Sync and is safeto cherry-pick separately.